Data variation dues to different kinds of factors:
In practice, we want to estimate the effects of primary variables as well as discover the unmodeled factors. For instance, we want to training a model by given class labels. However, the effects of primary variables can be blurred by unmodeled factors. Even the data from the same class can have different distributions or a mixture of distribution, which is also called heterogeneity.
The surrogate variable analysis (SVA) builds up a linear model for data variation and tries to construct surrogate variables to represent the batch effects.
The \(p \times n\) random data matrix \(X =(x_1, \dots, x_p)'\) contains \(p\) features and each feature \(x_i\) = \((x_{i1}, \dots, x_{in})'\) has the expression from \(n\) samples. The primary variables are \(Y = (y_1, \dots, y_n)'\).
Assume that \(\mathbb{E}[x_i|Y] = b_i S(Y) = b_i S\), where \(S\) is a matrix of basis function of primary variables with dimension \(d \times n\). Also, assume that the marginal model for each \(e_i = x_i - b_i S\) is known or approximated sufficiently. In the matrix form, the model can be written as: \[X = B S + E\] Moreover, we assume the following decomposition: \[X = B S + \Gamma G + U \] Leek and Storey (2008) showed this decomposition exist in the following sense. Suppose that for each \(e_i\), there is no Borel measurable function g such that \(e_i = g(e_1, \dots, e_{i - 1}, e_{i + 1}, \dots, e_m)\) almost surely. Then there exist matries \(\Gamma_{p \times m } G_{r \times n} (r <= n)\) and \(U_{p \times n}\) such that: \[X = B S + \Gamma G + U\] where the rows of \(U\) are jointly independent random vectors so that \[\mathbb{P}(u_1, u_2, \dots, u_m) = \mathbb{P}(u_1) \times \mathbb{P}(u_2) \times \dots \times \mathbb{P}(u_m)\] Also \(\forall i = 1, \dots, m, u_i \neq 0\) and \(u_i = h_i(e_i)\) for a non-random Borel measurable function \(h_i\).
When the structure of \(G\) is unknown, a set of surrogate variables \(\hat{G}\) are constructed by SVA algorithm.
If there are no batch effects, i.e.\(\Gamma = 0\), the residual matrix \(R = X - \hat{B} S\) should not have unusally large variations in some specific directions. To estimate the dimension \(r\) of unmodeled factors, the number of unusally large eigenvalues need to be determined.
In the SVA, an permutation test, a non-parametric approach, is proposed on the residual matrix \(R\). It compares the proportion of singular values in \(R\) to the proportion of singular values in randomized residual matrices, where each row is permuted individually to break down any structure across rows.
The sva algorithm does not attempt to estimate the unmodeled factors \(G\). In fact, the sva constructs surrogate variables to represent the variation not caused by primary variables.
Say in the previous step, \(\hat{K}\) unusally large eigenvalues are identified. For \(\lambda_k, k = 1, \cdots, \hat{K}\), take out its corresponding right singular vectors \(e_k\). In some sense, \(e_k\) can act as surrogate variable \(\hat{g}_k\). However, when \(\hat{G}\) and \(S\) are not orthogonal,\(\hat{B}\), the estimation of the effects of \(S\), is biased. So are \(R\) and \(e_k\).
To reduce this bias, next, subset of feasures \(X^{(r)}\) which have a strong assosiation with \(e_k\) are idtentified. Then refit the model with \(X^{r}\) and obtain the residual matrix \(R^{r}\). Last, identify the eigenvector \(e_{j_k}^{r}\) of \(R^{r}\) which has the largest correlation coefficient with \(e_k\).
We can choose \(e_{j_k}^{r}\) as \(\hat{g}_k\) or continue this iteration. This approach can up-weight the features that show strong assotiation with \(G\) and down-weight feasures that show strong association with \(S\).
We want to understand whether SVA is robust against mean heterogeneity. Also, we are interested in the distriution of eigenvalues of the heterogeneity data. The SVA is also related to PCA. We want to know how much information about heterogeneity can be captured by PCA as well.
In our simulation, there are 80 samples. The samples belong to two classes. First 40 samples belong to Class 1 and others belong to Class 2. The samples are collected from two batches: Batch1 and Batch2.
Each sample has 100 features:
Thus, in the simulation, the heterogeneity lies in a 1-dimension space.
In this case, \(\pi_1 = \pi_2 = 1\). It means that all the samples come from the Batch1. Therefore there is no batch effect in the data. This case is set as a baseline for understanding the behavior of eigenvalues in the analysis.
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
In this case, \(\pi_1 = \pi_2 = 0.5\). It indicates the balance in the following two ways:
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
#### Unbalanced Mean Heterogeneity Data
In order to better understand the behavior of SVA against heterogeneity, we design three cases for unbalanced batch effects.
In this case, it indicates:
By the symmetric of Batch 1 and Batch 2, we only need to consider the case \(\pi_1 = \pi_2, \pi_1 + \pi_2 < 1\). We do simulation for two sets of \(\pi_1\) and \(\pi_2\):
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
In this case, it indicates:
By the symmetric of Batch 1 and Batch 2, we only need to consider the case \(\pi_1 > \pi_2, \pi_1 + \pi_2 = 1\). We do simulation for two sets of \(\pi_1\) and \(\pi_2\):
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
In this case, it indicates:
By the symmetric of Batch 1 and Batch 2, we only need to consider the case \(\pi_1 > \pi_2, \pi_1 + \pi_2 < 1\). We do simulation for two sets of \(\pi_1\) and \(\pi_2\):
## [1] "The number of surrogate variable: 0"
## [1] "Mannually setting the number of surrogate variable as 1 to apply the sva algorithm"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5
## [1] "The number of surrogate variable: 0"
## Number of significant surrogate variables is: 1
## Iteration (out of 5 ):1 2 3 4 5